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THE RELIABILITY OF STANDARD SCORES IN 
ADDING ABILITY 



ARTHUR S. OTIS and PERCY E. DAVIDSON 
Leland Stanford Junior University 



In 1 the effort to set up standard scores in arithmetic, or standards 
in any school subject for that matter, it is necessary to keep in mind 
the different purposes to which such scores may be put. Stone 2 
measured school systems. Courtis 3 invites us "to measure the 
efficiency of the entire school, not the individual ability of the few." 
Others would doubtless urge "finding the individual." Perhaps 
the practical uses of scales or standards in the branches of the 
curriculum may be reduced to three. There is, first, the great 
desirability from the administrative point of view of determining 
the efficiency of instruction in groups, large and small. The pur- 
pose here is to learn of the effect of school conditions as exhibited 
in the achievements of classes, schools, and syst ans in comparison 
one with another. The idiosyncrasies of individuals are of concern 
only when they tend to differentiate a group from what is usual 
among school groups. There is, also, quite as important a need of 
standards in properly locating an . individual child either as an 
entrant to a class or as a candidate for promotion. And standards 
are required for the purpose of disclosing the peculiar weaknesses 
which act as causes of backwardness in the more general abilities. 
It is plain that a standard score might be derived in such fashion as 
to serve one of these purposes without adequately satisfying the 
others. 

The usefulness of standards for the analysis of pedagogical 
backwardness may not be obvious. Suppose that by the applica- 

1 The writers of the paper are under obligations to the following persons for con- 
siderate co-operation in the gathering of data: Superintendent Alexander Sheriffs, of 
San Jose, Cal.; Principals L. Bruch, A. L. Dornberger, V. Dornberger, J. E. Hancock, 
J. Manzer, R. A. Lee, of the same place; Principal M. L. Trace, of Hester; Dr. 
Margaret Shallenberger and Miss Henrietta Riebson, of the Normal Training School, 
San Jose. 

' A rithmetical A bilities and Some Factors Determining Them. Columbia University. 

J "Standard Scores in Arithmetic," Elementary School Teacher, November, 191 1. 
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tion of a standard test in addition it appears that a certain child is 
seriously retarded in this general ability. It becomes necessary to 
discover just wherein his weakness consists before subjecting him 
to a course of corrective training. Our knowledge of the complexity 
of mental processes makes it likely that the factors conditioning 
even so simple an ability as adding are very numerous. Even a 
casual analysis shows four subsidiary processes that may, and in 
fact do, vary independently, weakness in any one of which may serve 
to pull the child below the average in adding. These are: the 
ability to make all the possible combinations of single digits with 
average rapidity and accuracy; the ability to hold securely in mind 
numbers of two- and three-place value while adding in the next 
digit or combination of digits in " running up " a column; the ability 
to "carry" from one column what is "over" to the column next on 
the left; and the ability to write down figures with average speed. 
The further analysis of these four constituent processes would 
doubtless lead into intricate psychological issues, which may or may 
not be required practically for the economical use of time in cor- 
rective treatment. But, in any case, it appears that before any- 
thing positive can be said of the weakness of the backward child in 
question, norms for each of these processes must be found. Once 
they are known, the performance of the child may be compared 
with them, and specially devised exercises may be employed to 
bring him up to standard at just those points, and those only, with 
which his backwardness in adding is associated. Practice in making 
the addition of single digits is not appropriate in a case where the 
difficulty consists in holding number meanings in mind. The 
processes may be as unlike as association and attention span. If 
improvement does not come from the use of the exercise selected, 
the situation calls for further analysis, one possible outcome being 
the discovery of fundamental defect. Norms or standards of per- 
formance are of course required for each factor distinguished. 

AIMS OF THE PRESENT STUDY AND SUMMARY 

From what has been said it is apparent that there are at least 
three kinds of pedagogical tests, which may be characterized 
severally, after their unlike functions, as administrative, grading, 
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and diagnostic tests. The first kind relates to groups, the last two 
relate to individuals. The third is comparable to the clinical 
test for mentality, except that it aims frankly at the analysis of 
specialized training rather than inherent capacity or "general 
intelligence." It may well happen, of course, that a given test 
may serve all three purposes, but it need not, and doubtless many 
tests will be devised for one of these purposes only. 

This study took its departure from an attempt to use the 
Courtis "Standard Scores in Arithmetic" to measure arithmetical 
backwardness in a class of backward boys. These boys were at 
quite different points in their arithmetical training, and it was 
desired to grade them with reference to the average attainments of 
normal public-school children in order to estimate the time needed 
to bring them up with their classes. It was necessary, too, to locate 
their arrests and peculiarities as closely as might be so that then- 
further training could be applied where it would do the most good. 
The Courtis tests, although used chiefly for the measurement of 
groups, are recommended for individuals, and several of them are 
of diagnostic value in the sense that they test abilities that are 
fairly elementary and at the same time highly important elements 
in the more inclusive abilities of which they are a part. 

The Courtis scores were derived from a single performance from 
each individual. They are averages from the work of some nine 
thousand children in different parts of the country, and they may 
be taken as true measures of the average single performance in the 
processes they examine. Undoubtedly the average of single trials 
from large groups may be safely compared with them. But it 
seemed questionable, considering the possibility of accidental error 
from one performance, to measure individuals and small groups by 
means of them. It thus became necessary to consider the signifi- 
cance of one trial as an index to the standing of any individual in 
the abilities in question. The Courtis Test No. 1 on the addition 
combinations was chosen for examination. 

This study, then, is an attempt to ascertain: first, the approxi- 
mate reliability of a single score as a measure of an individual's 
status in the ability to make the addition combinations in writing; 
second, the number of scores required of an individual in order to 
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obtain a desirable degree of reliability; and third, the size of a 
group that may be measured with reasonable accuracy by means of 
one score from each member of the group. 

With reference to the first aim it should be explained that by 
"unreliability" is meant the probable deviation of an actual single 
score from a hypothetical single score which would accurately place 
the individual on a scale of status. This hypothetical score was 
derived as follows. Twenty-five tests were given each individual, 
and the median, or the thirteenth highest score, was regarded as the 
measure of the individual's ability to write the addition combina- 
tions. It will be seen at once that if an individual began with a 
score of 55 combinations and attained a median of 75, his first score 
could not be said to be in error to the amount of 20 combinations, 
due to the fact that the median is uniformly higher than the first as 
a result of practice. It was therefore assumed that the middle 
measure of all the first scores of 202 children (51 combinations) 
would represent the same relative position on the scale of status as 
the middle measure of all the medians (70 combinations). The 
hypothetical first score was consequently considered to be for each 
individual fifty-one seventieths of his median. The deviation of 
the actual first score from the hypothetical first was then found in 
each case. 

This deviation in the 202 cases examined varied from no com- 
binations up to 26, with a middle measure of four combinations and 
an average of slightly over five. That is to say, any individual's 
score from one performance has one chance in two of deviating less 
than four combinations, and one chance in four of deviating ten or 
more, from that measure which would properly place him on a scale 
of status in this ability. This would seem to make one performance 
altogether too uncertain as a test of an individual's standing for 
purposes either of grading or of diagnosis. 

In regard to the second purpose of the study it was found that 
safely to measure the ability of an eighth-grade child to write the 
addition combinations probably twenty-five trials are required. 
This number will practically assure a result correct within three 
combinations. If a smaller reliability will suffice for any purpose 
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the appropriate number of trials may be noted from the table 
given farther on. 

As to the reliability of a single score from each member of a 
group for purposes of group measurement, on the average, measures 
of groups of 25 were in error approximately 1.7 combinations, 
groups of 50 were in error approximately 1 . 2 combinations, groups 
of 100 approximately 0.7 combination. This applies to eighth- 
grade groups. 

PROCEDURE 

Printed blanks for making the tests were prepared. The print 
and arrangement of the Courtis Test No. 1 were exactly duplicated 
for the first minute. This test consists of 120 combinations, 
including the hundred possible ones and twenty duplicates arranged 
in groups of five such that each five presented approximately the 
same difficulty as any other five. Four other slightly altered 
arrangements of the same test were made to occupy the remaining 
four of the five minutes devoted to adding on each of the five days. 

Two hundred and seventy eighth-grade children in the eight 
larger grammar schools of San Jose, Cal., were given the tests. The 
desirability of having all the tests given by the same person — one 
of the writers — made it impossible to present them at the same time 
in all the schools. Each school was tested at about the same time 
on the succeeding days with insignificant exceptions. The one 
marked alteration of this order was with some fifty children in one 
school who were given the tests for the fifth day after a lapse of 
fourteen days. An examination of the returns from this group 
revealed a barely perceptible fall for which slight allowance was 
made. A formula was memorized and spoken to each class at the 
beginning, drawn in such a way as to insure a sympathetic and 
complete understanding of the test and a quick response to the 
signals. The children were encouraged to write as many com- 
binations in the minute as they could without error, but emphasis 
was placed upon accuracy. Throughout incentive prevailed, as it 
seemed, from motives of individual and school rivalry and personal 
improvement. There was no evidence of fatigue. Time was taken 
by stop-watch. Close observation of the pupils showed that in a 
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few cases figures were improperly added on after the final signal 
but it is felt that the amount of this that was unnoticed by the 
examiners and uncorrected was really negligible. A rest of about 
thirty seconds occurred between each two trials. 

From a preHminary study it had been concluded that five 
minutes' adding with rests of about half a minute avoided fatigue 
and preserved incentive satisfactorily. For various reasons twenty- 
five was chosen as the number of the tests from which a measure of 
status might be derived. This point on the practice curve was 
considered to be sufficiently far advanced practically to eliminate 
accidental error and yet not so far advanced as to involve learning 
ability unduly. Any number must be, perforce, the result both of 
some practice and of some accidental error. As to the effect of 
practice, it is evident no sharp line can be drawn between status 
and learning ability, in view of the fact that practice effects are 
marked after the first performance. It is merely a question of 
arbitrarily deciding upon the amount of practice which shall fairly 
be admitted in detenriining status. Surely children are entitled to 
the amount of improvement they can make in an hour's time, before 
being classified for a course of training of any considerable duration. 

THE DATA 

The number of combinations written each minute and the num- 
ber of errors made were ascertained for each individual. The 
papers of twenty children who missed the fifth day were included. 
The decision as to how to deal with errors was made with difficulty. 
Courtis regards them as negligible in his test. The present writers 
supposed that such haste as would result in errors would at the same 
time increase the number of combinations written per minute, and 
various attempts were made to evaluate errors in terms of time, 
that is, to find the smaller number of combinations which might be 
assumed to have been written in the event that errors had not been 
made. It was then a matter of surprise to discover that scores 
containing many errors did not average as many combinations as 
scores without them, which seems to point to some third factor as 
being responsible for errors and smaller scores alike. For lack of a 
better term this factor may be called the "predisposition of the 
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moment." In any case, because of the smaller scores of the 
children making many errors and an observed erratic character of 
their performance, a somewhat complicated process of elimination 
was substituted. 

The papers of any individual who made more than two errors in 
any one minute throughout the twenty-five were set aside, provided 
that in the event of not' more than three errors appearing in one 
minute the results might be included when there were not more than 
five errors in one day and not more than ten in the five days. Some 
half-dozen children escaped on this provision. The papers of any 
individual who made more than five errors in any one day and more 
than fifteen in the five days were set aside without exception. No 
errors at all was thought to be too exacting, and inconsistent with 
our adult practice in which we work with moderate care and check 
for mistakes. The papers thus selected were considered to rep- 
resent a reasonable degree of accuracy. The result of the selection 
was as follows : 





25 Tests 
(S Days) 


20 Tests 
(4 Days) 


Total 


Accepted 

Rejected 

Total 


182 

65 

247 


20 
3 

23 


202 

68 

270 



Figs. 1, 2, and 3 on the accompanying plate show the actual 
scores of three pupils selected to illustrate the possible variation 
among individuals. Of the twenty-five scores of each of the 202 
individuals were found: the average of the first two scores, the 
average of the first three, four, five; the median of the first seven, 
of the first thirteen, nineteen, and the whole twenty-five, being re- 
spectively the fourth, seventh, tenth, and thirteenth highest. In 
the cases of the twenty pupils who took the test four days only, the 
twelfth highest was found to be the value which most nearly ap- 
proximated the probable value of the median. These values for 
the three pupils represented in Figs. 1, 2, and 3 are shown as the 
longer lines in Figs. 1a, 20, and 30. 

The median of the 202 first scores was found, also the median of 
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the averages of the first two, of the first three, and so on, for each of 
the above quantities. These are shown in the following table: 

Combina- 
tions 

The median of the 202 first scores 51 

The median of the 202 averages of the 1st and 2d scores 54 

The median of the 202 averages of the 1st, 2d, and 3d scores ssJ 

The median of the 202 averages of the 1st, 2d, 3d, and 4th scores 56J 

The median of the 202 averages of the 1st, 2d, 3d, 4th, and 5th scores 57 

The median of the 202 medians of the first 7 scores 60 

The median of the 202 medians of the first 13 scores 64 

The median of the 202 medians of the first 19 scores 67 

The median of the 202 medians of all 25 scores 70 

The median of the 202 medians of the actual highest scores 79 

The first scores varied from 28 to 88 combinations, the middle 
half falling between 42 and 60. The medians of the twenty-five 
scores in the 202 cases varied from 42 to 107 combinations, the 




middle half falling between 62 and 80. The quantities of the table 
are plotted on the plate (Fig. 4a). From these was constructed an 
idealized practice curve (Fig. 4) 

upon which appear six selected ' — * ,0 

values. . (,<,- 

Hypothetical first scores, or 
first scores which would locate 
each individual upon the scale ^^ — "" ~ -S3 

of status in correct relation to r ^' * _ 5Cl 

every other individual, were 

computed, as explained previously, by taking fifty-one seventieths 
of each individual's median. Hypothetical values of the averages 
of the first two scores were computed by taking fifty-four seventieths 
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of the median in each case, and so on for the other values shown in 
the table above. These values for the three individuals represented 
in Figs, i, 2, and 3 are shown as the shorter lines in Figs. 1a, 2a, 
and 3a. 

Figs. 1a, 2a, and 3a may be interpreted thus: In the case of the 
first pupil, from some cause, the first score (the longer line) lacks 26 
combinations of equaling that hypothetical first score (the shorter 
line) which would place him correctly on the scale of status. The 
average of the first two scores brings him within 19 combinations of 
the corresponding hypothetical value, the average of the first three 
scores within 14 combinations of its hypothetical value, and so on. 
The second pupil would vary but little in relative position by any 
number of measurements, so closely does his actual performance 
approximate the hypothetical. The case of the third pupil is quite 
the opposite of the first. Because of slight improvement the first 
score would rank this pupil far too high. 

Out of 182 cases where all twenty-five tests were taken the first 
scores deviated from the hypothetical values by between zero and 
one combination in 26 cases, by between one and two combinations 
in 18 cases, in order as follows: 



Combina- 
tions. . 
Cases .... 

Combina- 
tions 
Cases. . 



0123456789 
26 18 25 23 17 11 9 8 12 



10 11 12 13 14 15 16 

■ 7 4 4 3 3 



17 18 19 20 21 22 23 24 25 

IIOOOOOIOI 



26 



The percentage of deviation is not given. It would not augment 
the obtained degree of accuracy appreciably. The values of the 




F\ 9 .5 

table are plotted in Fig. 5 of the plate. Fifty per cent of the first 
scores were in error four or more combinations, 10 per cent were in 
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error u combinations, one score was in error 26 combinations. 
These and the amounts of error for the average of the first two 
scores, the first three, and so on, are shown in the following table. 



Cases 


so Per cent 


10 Per cent 


One Case 




4 combinations 

3-S 

3 

2.8s 

2.76 

2-55 

2. is 

i-5 

1 (probably) 


11 combinations 

9 
8 

7 
7 
6 

S 

3 , 

2 (probably) 




Average of ist and 2d scores. . . . 


25 


Average of ist, 2d, 3d, 4th 

Average of ist, 2d, 3d, 4th, 3th. . 
Median of first 7 scores 

Median of first 25 scores 


20 

19 
18 
10 

S 

3 (probably) 



f; 9 .g 



The values in the 50 per cent column of the table are plotted in 
Fig. 6 of the plate where each value shows the median deviation 
of the value represented directly above in Fig. 40. It appears 
that any individual eighth-grade , 
child tested in the addition com- 
binations by one trial stands one j.-i 

chance in ten of being displaced l " j 

from his true standing by n com- 
binations. From two trials the 
displacement would be nine com- 
binations, and so on, as indicated in the 10 per cent column of 
the table. 

The procedure for finding the reliability of group averages based 
upon one score from each member of the group was as follows. The 
averages of all the firsts and of all the medians in the 202 cases were 
found. They were respectively 51.5 and 69 combinations and form 
a ratio of 0.748. Two hundred of the individuals, chosen by 
chance, were distributed into eight groups of twenty-five. These 
groups were treated in the way individuals had been treated in the 
first part of the study. The averages of the first scores and of the 
medians of twenty-five scores were found for each group of twenty- 
five. Hypothetical averages of first scores were found by taking 
o . 748 of the actual averages of first scores. The average deviation 
of the hypothetical averages of first scores from the actual averages 
of first scores for the eight groups was approximately 1 . 7 combina- 
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tions. The two hundred individuals were then distributed into 
eight chance groups of fifty and eight chance groups of one hundred. 
The approximate average deviation in these cases was respectively 
i . 2 and o . 7 combinations. 

CONCLUSIONS 

The study was conducted primarily in the hope of learning some- 
thing of the value of standard scores for the measurement and 
understanding of the work of individuals in arithmetic by means of 
comparatively few samples. Test No. i of the Courtis series was 
chosen for study because averages had been found for it from a large 
number of children and because it was considered comparatively 
elementary. There is, of course, no telling how representative the 
process of adding single digits is. It is probably as simple as any 
process in arithmetic, and the presumption would be that the 
reliability of a single or a few trials in this process would be far 
greater than that of the more complicated processes which make up 
the important practical abilities in the subject. The results of the 
study show that a single or a few trials must be used with caution 
in determining the status of individuals in these abilities . 

The unreliability of a few trials may be shown again by reference 
to the standard scores of Courtis based upon the results of the best 
30 per cent of his group of some nine thousand children. His 
averages for this test range from 26 for the third grade up to 65 for 
the ninth grade, a difference of 39 combinations. Many of the 
children of the group studied here made up this amount of differ- 
ence during the twenty-five trials, the average difference between 
the first and last being 28 combinations. Often the difference 
between several grades would be covered in the first two or three 
trials. A child's place on the Courtis scale would thus depend 
largely upon the number of trials he made. 

The question remains as to whether it is possible to use the 
valuable scores of Courtis for the measurement of individuals. 
Because of the unreliability of one score from an individual it is 
clear that such a single score cannot be compared directly with an 
average of Courtis. And of course it is impossible to measure the 
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average of several scores of an individual against an average of first 
scores of many individuals. If, however, we may assume that the 
curve of improvement based upon 202 returns (Fig. 4) is repre- 
sentative of the rate of improvement of so large a group as the 
1,370 eighth-grade children of Courtis, then we have a basis upon 
which to proceed. The average of first scores of the 202 children 
of this study is 51 . 5 combinations and the average of the median of 
twenty-five trials is 69. If about the same ratio should exist in the 
larger group of Courtis, which seems likely, the average of medians 
of twenty-five trials to correspond would be 69/51.5 of 57, his 
average of first scores, or 76.4 combinations. We may then say 
that the median of any eighth-grade pupil's twenty-five successive 
trials should approximate 76 combinations to compare favorably 
with what is normal in the country. Of the San Jose group about 
33 per cent made this score. Or, if Courtis' standard score for the 
eighth grade of 63 combinations be used, then the pupil's median 
should approximate 84 combinations. Only 12.5 per cent of the 
San Jose group reached this score. 

The conclusions of the study require qualification in that it has 
been assumed that a first trial is a definite thing, whereas it is of 
course largely relative to what has preceded it of a similar nature. 
Probably the 202 children of this group, as well as the eighth-grade 
children from whom the Courtis scores were derived, had not had 
practice in adding in the form of the test for several years, although 
the daily work in arithmetic must have given some continuous 
practice incidentally. Children practiced systematically on the 
precise form of the test would probably be more reliably measured 
by one trial if this should be made during or immediately after the 
training. Nothing can be said here of what this increased reliability 
would amount to. It would probably not justify the use of one or 
a few trials for a formal classification of individuals. 

From a certain point of view the usefulness of the Courtis Test 
No. 1 may be questioned. It is plainly not a test of addition even 
when addition is limited to the handling of columns of figures and 
the rational side of the topic is disregarded. As has been noted, the 
ordinary process of addition is much more complicated than writing 
the sums of single digits. The value of the test as a measure of 
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addition as usually met with can be determined only after its cor- 
relation with the more inclusive process has been ascertained. It is 
doubtful whether the correlation is sufficiently high to warrant the 
use of the test for grading. Undoubtedly the better way would be 
to make use of a test more typical of addition as such. Unfortu- 
nately Courtis does not have a test of addition except as a part of a 
larger one on the four fundamentals. This has the effect of con- 
cealing the standing in any one fundamental process and so makes 
it inconvenient to discover just what special training a pupil needs. 
As an aid in the analysis of backwardness in addition when this has 
been found with the use of a satisfactory grading test, Test No. i 
would seem to be important — just how important can be known 
only after its correlation with adding ability has been determined, 
along with the correlations of the other important components of 
the ability, whatever they are. 

A final word should be said of the ability examined by the test 
on the addition combinations. It has been understood throughout 
the paper as the ability to write with reasonable accuracy and 
intelligibility the sums of the possible combinations of single digits 
with average or normal speed. Accuracy and intelligibility in 
writing have been defined practically by eliminating returns passing 
certain limits, limits which have been described in the matter of 
accuracy. The limit of intelligibility was a rough one. If the 
figures could be definitely made out the papers were accepted. 

But this raises a question. How is the writing of figures related 
to readiness of response in supplying the sums of the combinations 
mentally ? Some children are doubtless held back very much by 
the difficulty of manipulating the pen or pencil. Others write about 
as fast as they can add. Just how children range between these 
limits is unknown. Perhaps this need not be a matter of concern 
for purposes of grading, for the ability to write combinations is a 
practical one that may be treated as a unit. But in searching for 
causes of backwardness it would be wholly desirable to distinguish 
between standing in writing figures and standing in readiness and 
accuracy of association in the combination of numbers as such. It 
is therefore quite possible that an oral response will prove more 
satisfactory for this purpose. It hardly seems possible that the 
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amount of improvement shown by the 202 children of this study is 
to be explained in terms of increased readiness of mental association 
alone. In fact, a noticeable depreciation in the quality of the 
writing and other signs of haste arouses the suspicion that a con- 
siderable part of it came from an increased facilitation of a neuro- 
muscular sort in the manipulation of the writing instrument. But 
the determination of these issues must be left for future study. 



